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Abstract 

Within the framework of on-hne learning, we study the generahzation error of an ensemble 
learning machine learning from a linear teacher perceptron. The generalization error achieved by 
an ensemble of linear perceptrons having homogeneous or inhomogeneous initial weight vectors is 
precisely calculated at the thermodynamic limit of a large number of input elements and shows rich 
behavior. Our main findings are as follows. For learning with homogeneous initial weight vectors, 
the generalization error using an infinite number of linear student perceptrons is equal to only half 
that of a single linear perceptron, and converges with that of the infinite case with 0{1/K) for a 
finite number of K linear perceptrons. For learning with inhomogeneous initial weight vectors, it is 
advantageous to use an approach of weighted averaging over the output of the linear perceptrons, 
and we show the conditions under which the optimal weights are constant during the learning 
process. The optimal weights depend on only correlation of the initial weight vectors. 
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I. INTRODUCTION 

Many ensemble learning algorithms, such as bagging Q| and the Ada-boost [3] algorithm, 
try to improve upon the performance of a single learning machine by using many learning 
machines; such an approach has recently received considerable attention in the field of 
machine learning. 

Theoretical analysis of the generalization error of ensemble learning has been done using 
statistical mechanics . SoUich analyzed batch-mode ensemble learning with linear per- 
ceptrons under the noisy learning condition ^, and demonstrated that the generalization 
error can be reduced by applying different subsets of entire learning examples. Urbanczik 
analyzed the generalization error of ensemble learning by using simple perceptrons based on 
on-line learning^]. He discussed two types of ensemble learning with K simple perceptrons. 
In the first scenario, a new simple perceptron obtained by averaging the weight vectors of 
the K simple perceptrons was used as that of ensemble learning (model 1). In the second, 
the average of the outputs of the K simple perceptrons was used as an output of the en- 
semble learning (model 2). Since the output property of the single perceptron is non-linear, 
it requires 0{e^) calculations to theoretically obtain the generalization error for model 2 
where K is the number of simple perceptrons. Urbanczik mainly discussed model 1 to avoid 
this difficulty and demonstrated that in the limit of K oo, the generalization error of 
model 2 converges to that of model 1. 

When the linear perceptron is employed, models 1 and 2 are identical, and analysis for a 
finite number of K becomes possible. This is the main reason we employed linear perceptrons 
in this paper. We assume two initial conditions to calculate the generalization error of the 
ensemble learning machine: one is that the correlation between the weight vectors of learning 
machines is homogeneous, and the other is that the correlation is inhomogeneous. In the 
homogeneous case, the correlation between the weight vectors of learning machines will be 
uniform and the weight vectors will remain uniform throughout the learning process because 
of the symmetry of the evolution equation used as the update rule. Thus, a simple average 
of the K outputs of the learning machine can be used as the ensemble output of the learning 
machine. We derived the generalization error and found that it consist of two terms; the 
first depends on the number of learning machines K, while the second does not depend on 
K. It will be confirmed that the generalization error is equal to half that of a single learning 
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Teacher network 




FIG. 1: Network structure of teacher and student networks, all having the same network structure. 

machine when K oo^ and that the generahzation error converges into that of the infinite 
case with 0{1/K) when K is finite. 

In the inhomogeneous case, the generalization error can be improved by introducing 
weights to average the K outputs of the learning machines (i.e., to obtain a weighted average 
rather than a simple average), and adapting the weights to minimize the generalization error 
(i.e., parallel boosting) Q is required. We also analyze the time dependence of the weights 
used for averaging. 

II. MODEL 

A. Network structure and ensemble output 

In this paper, we discuss the generalization error of ensemble learning with K linear 
perceptrons. We assume the teacher and student networks receive dimensional input 
X = (xi, . . . ,xn) as shown in Fig. ^ and consider the thermodynamic limit of ^ oo. 
We also assume that the elements of the independently drawn input x are uncorrelated 
random variables with zero mean and variance; that is, the elements are drawn from a 
probability distribution P{x). The size of x is then \x\ = 1. 

(x.)=0, ((x.)2) = ^, \x\ = l, (1) 
where (■ ■ ■ ) means averaging over the distribution P{x). 
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The teacher is a hnear simple perceptron (as shown) that outputs v{m) for dimensional 
input x{m) = (xi(m), . . . ,Xn{itl)) at mth learning iterations. 



N 

v(m] 

1=1 

B = iB,,...,B^), (3) 



J2BiXi{m) = B-x{m), (2) 



Each element Bi of the teacher weight vector B is drawn from the probability distribution 
of zero mean and unit variance, and is fixed throughout the learning process. In this case, 
the size of the teacher weight vector is Vn, 



{B,) = 0, my) = l, \B\ = VN. (4) 

K linear simple perceptrons are used as the student networks that compose the ensemble 
learning machine. Each student network has the same architecture as the teacher network 
and outputs Ukijn) for the N dimensional input x{m). 

N 

UkiTn) = {m)xi{m) = {m) ■ x{m) (5) 

i=l 

J\m) = (Jf(m),...,4(m)) (6) 

where Uk{m) denotes kth student output and J^{m) denotes weight vector of fcth student. 
The student weight vector is changed through the learning process, so the size of J^{m) is 
assumed to be | J'^(m)| = lk{rrt) \fN and the size of lk{fn) is 0(1). We call Ikirn) the length 
of the student weight vector J^{m) at mth learning iterations. 

The ensemble output of the student networks u{m) is given by the weighted average of 
each student network output with the weight for averaging Ckij^) ■> 



K K 

u{m) = 

k=l k=l 
K 



^Ck{m)uk{m) = J2'^k{'m)J''{m) -xim), (7) 

k=l 

K 

^Ci(m) = 1. (8) 

Baggingim is a form of ensemble learning using a fixed uniform weight for averaging 
so that Cfc(m) = 1/K throughout the learning process while parallel boostingj?! uses the 
weighted average. 



B. Learning algorithm 

Learning is defined as a student network modifying tlie weiglit vector to make tlie output 
Uk{m) approacli tlie teaclier output v{m) for an given input x{m). We use tfie gradient 
descent algoritlim to modify the student's weight vector J^{m). An identical input x{m) is 
applied to all student networks in the same order. Therefore, all the student networks can 
independently learn the relation between the input x{m) and the target v{m). The learning 
equation is 



where m denotes the iteration number. As shown in Eq. 0, the weight J^{m) is updated 
by using single input x{m), and then is not used again after the learning. This is called on- 
line learning. According to the above formulation, the weight vector J'^(m) is statistically 
independent of a new learning input x{m) and this makes analysis easier. 

III. THEORY 

In this paper, we consider the thermodynamic limit of ^ oo to analyze the dynamics 
of the present learning system using statistical mechanics. In the following sections, the 
iteration number m is neglected to simplify notation of equations. 

As pointed out, we will discuss learning based on on-line learning. In on-line learning, the 
input X is not used after the learning and the weight vector is statistically independent 
of a new learning input. Note that the distribution of the normalized output of the student 
network tt^ = Uk/lk obeys the Gaussian distribution of zero mean and unit variance at the 
thermodynamic limit of A^ — oo. For the same reason, the distribution of teacher output 
V obeys the Gaussian distribution of zero mean and unit variance at the thermodynamic 
limit. Thus, the distribution P(t>, {uk}) of v and {uk} is 




(9) 



5 



P{v,{Uk}) 



exp 



/ 1 Ri R 

Ri 1 qi,2 



v,{uk}fi:-\v,{uk}) 



(10) 



K-l Rk 
Ql,K-l (ll,K 



(11) 



Rk-1 (lK-l,l 1 (lK-l,K 

\ Rk QK,! <lK,K-l 1 

Here, Rk is the overlap between the teacher weight vector B and the student weight vector 
(i.e., it is the direction cosine of B and J^), and qkk' is the overlap between two student 
weight vectors J'^ and J^'. These two overlaps, are defined as 



Rk 



Qkk' 



B 

\B\ ■ \ J\ 
jk . jk' 



1 ^ 



\J^-\J^'\ Nhh' 



1 ^ 

^ \ ^ Tfc jk' 



Rk and qkw are the order parameters of the present learning system. 



Generalization error 



(12) 
(13) 



The squared error of the teacher output v and the ensemble output of the student networks 
u is used to evaluate the student network's performance. 



1. 



— \2 




K 



Y^CkJ' 



X 



(14) 



k=l 



Here, Ck is a weight for averaging and the ensemble output of the student networks is a 
weighted average of each student network's output. The generalization error eg is given by 
squared error e in Eq. (fT^ averaged over the possible input x drawn from the Gaussian 
distribution P{x) of zero mean and variance. 
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eg = J dxP(x) e 

For on-line learning at the thermodynamic limit, as mentioned, the distributions of the 
stochastic variable Uk = {J^ ■ x)/lk and v = B ■ x obey the Gaussian distribution of zero 
mean and unit variance. Hence, the generalization error can be rewritten using Eqs. (fTUIl. 

(HU, 



K ^ / K ^ 



^9 



j dv^duk P{v,{uk})]^{v -^CkUk\ . (16) 

k=l \ k=l / 

This equation is the {K + l)th Gaussian integral with {uk} and v, and it enables us to make 
the calculation. The result is shown by the following equation. 

1 f ^ ^ ^ 1 

= 2 ^ -'^ ~ 2 ^ CkRkh + ^ ^ CkCk'Qkk'hh' ? (17) 

I k=l k=l k'=l ) 

Consequently, the dynamics of the generalization error is calculated by substituting the time 
step value of Ik, Rk, and g^fc' into Eq. (fTTj) . Therefore, we solve the dynamics of Ik, Rk, and 
qkk' in the next subsection. Here, Ik, Rk, and qkk' are macroscopic parameters that represent 
the system dynamics. 



B. Dynamics of order parameters 

We first derive the dynamics of the length of student weight vector Ik To obtain the 
differential equation of Ik, we square both sides of equation Q. We then average the term 
of equation Q by the distribution of P{v, {uk})- Note that x and J'^ are random variables, 
so the equation becomes a random recurrence formula. We formulate the size of the weight 
vectors to be 0{^/N), and the size of input x is 0(1), so the length of student weight vector 
Ik has a self-averaging property. Here, we rewrite m as m = Nt, and represent the learning 
process using continuous time t. We obtain the deterministic differential equation of Ik at 



7 



the thermodynamic hmit, 



dlk _ 1 ll , . 

n - ^ ^^^^ 

Note that = 1 is a stable fixed point of this equation. Next, we derive the differential 
equation of the overlap between the teacher weight vector B and the student weight 
vector J'^. The differential equation of overlap Rk is derived by calculating the product 
of B and Eq. Q, and then averaging the term of the equation by the distribution of 
P{v, {uk})- The overlap R^ also has a self-averaging property, and at the thermodynamic 
limit, the differential equation of Rk is then obtained through a calculation similar to that 
used for Ik. 

dRh 1 Rk I ^ I 



To calculate the generalization error in Eq. (|T7jl we have to obtain the differential equation 
of overlap qkk' defined by Eq. (|T!^ ^ . The overlap qkk' also has a self-averaging property, so 
we can derive the differential equation at the thermodynamic limit. We define the overlap 
Qkk' = (J'^ ■ J'^')/Nlklk', the differential equation is derived by calculating the product of Eq. 
© for J'' and the equation for J'^', and we obtain the deterministic differential equation 
below. 

dQkk' _ Qkk^ ( — + — ^ (20) 

Equations (fTH)) . (fT^ . and (pH) form the simultaneous differential equations. 



IV. RESULTS 



A. Homogeneous correlation of initial weight vectors 

The component of independently drawn teacher weight vector B is as shown in Sec. Ill A| 
and the component of the student's weight vector Jf,2 = 1 ~ is initialized by being 
drawn from independent random variables with zero mean and unit variance in this section. 
In this case, the initial values of the order parameters are 



4(0) = 1, i?fc(0) = 0, qkk'{0) = 0. (21) 
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From the symmetry of the evolution equation for updating the weight vector, 



k{t)^l{t), Rk{t)^R{t), q,,,{t)^q{t) (22) 

are obtained. The dynamics of order parameters l{t), R{t), and q{t) are derived as Eqs. 
(1221), (El), and (j2Sl) by substituting the above conditions into Eqs. (ITHll. (ITn|) . and 



dl{t) ^23) 



dt 2l{t) 
dR{t) 



dt 
dq{t) 



1-R{t), (24) 
l-q{t). (25) 



dt 

Note that l{t) = 1 is the stable fixed point of the dynamics of l{t) and is given by solving 
dl/dt = 0. 

Next, we can easily solve the above equations analytically. 



m = 1, (26) 

R{t) = l-exp(-t), (27) 
q{t) = l-exp(-t). (28) 

From Eqs. (j77|) and (j2HI), lit) = R(t). Since the order parameters do not depend on K, the 
optimal weights for averaging should he Ck = l/K. By substituting Eqs. (j^ . ^T7\i . and 
for Eq.()17|). the generalization error for K student networks is rewritten as 



ef (t) = I + _ Ritmtr + {R{tm - 1? ] m 

1 r 1 - R{t) 



2 t K 

1 r exp(— t) 



+ (l-i?(t))j (30) 
+ exp(-t)| (31) 



2\ K 

e^(t) denotes the generalization error with K student networks. The first term on the left 
side of this equation depends on the number of student networks K and becomes negligible 
when K goes to infinity. The second term does not depend on so it remains and cannot 



9 



0.1 



0.01 



0.001 



0.0001 



K=l 



K=10 



K=10000 



K=l 



K=3 



K=10 



le-005 



le-005 







4 6 

Time: t=m/N 



10 



Time: t=m/N 



(a) Theoretical results. (b) Results obtained through a numerical 

simulation. 

FIG. 2: Dependence of the ensemble learning generalization error on the number of student net- 
works K. 

be ignored. Substituting il' = 1 for Eq. (jHO)), we show that the generalization error for a 
single student network is Cgfi) = 1 — R{t), and this error is identical to the generalization 
error of a simple perceptron From Eq. (j30|l . when K becomes infinite, the generalization 
error of an ensemble of K student networks asymptotically converges to (1 — i?(t))/2. Hence, 
the generalization error of an ensemble of K student networks converges to half that of a 
single student network at the limit of K going to infinity. Note that Eqs. ()30|) and (jHT|) 
depict dynamics of the generalization error of the ensemble because l{t) = 1 is the fixed 
point of the learning process as shown in the formulation of the initialization given in this 
subsection. 

Figure El shows the K dependence of the generalization error of the ensemble: (a) shows 
theoretical results obtained using Eq. ljHlj) . and (b) shows the results obtained through a 
numerical simulation. In these figures, the horizontal axis is time t = m/N, and the unit time 
corresponds to the time needed to feed in N inputs x. The vertical axis is the generalization 
error eg. 

First, the theoretical results for K = 1,3, 10, and 10000 are shown in Fig. Efa) as the 
diagonal lines. The generalization error for larger K ( the lower lines ) shifted and converged 
to half of the generalization error for K = 1 with respect to the order of 1/K. Next, the 
simulation results for K = 1,3, and 10 are shown in Fig. El^b). These results agree with the 
theoretical results, confirming the validity of the theoretical results. Hence, in the following 
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analysis, we show only the theoretical results. 



B. Inhomogeneous correlation of initial weight vectors 

When the correlation between student weight vectors is inhomogeneous, for instance, if 
the number of student networks is K=3, and the initial student weight vector J^{0) = J'^{0) 
and J^(0) is independent of J^(0) = J^(0), it seems natural to select weights for averaging 
where Ci = C2 = 0.25 and C3 = 0.5, instead of using the uniform value of Ci = C2 = C3 = 
1/3. This consideration suggests that the optimal generalization error can be obtained by 
using a weighted average of the student outputs when the correlation between students qkk' 
is inhomogeneous. This method is called "parallel boosting" Q . 

Because the student weight vectors are inhomogeneous, overlap Rk and overlap q^k' differ 
from subscript k or k', and a weight for averaging Ck will depend on Rk and qkk' to reduce 
the generalization error. We assume that the length of the student weight vector /^(O) = 1, 
thus it is identical for subscript k. 

The optimal weights Ck satisfy the following condition. 



Since eg is quadratic (second-order) function with respect to Ck, Eq. becomes a linear 
equation, and we can easily obtain the optimal Ck when the order parameter hit), Rkif) and 
Ikk'if) are given. Order parameters Rkif) and qkk' if) are time-dependent parameters, so the 
optimal weights for average Ck given by Eq. (j32j) generally become time- dependent Ck{t). 
In this case, we assumed /^(O) = 1, which then means lk{t) = 1 (see Eq. ((221) )• Substituting 
lk{t) = 1 and J2k=i Ck{t) = 1 to Eq. (fT7|) . we obtain the generalization error as 



K-l 



1 



Y,Ck{t){Rk{t) - RK{t)) - RK{t) 



k=l 




+ 



cum - QkKit)) - J2 Ckim - ^kKit)) 



k=l k=l 



K-l K-l 



+ Ck{t)Ck'{t){l + qkk'it) - qkK{t) - qk'K{t)). 



(33) 



k=l k'=2 



Equations (fT^ and can then be solved analytically. 
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Rk{t) = 1 - (1 - i?fc(0)) exp(-t), (34) 
qkk'{t) = 1 - (1 - gfefc'(O)) exp(-t), (35) 

where Rk{0) and Qkk'i^) are the initial values of Rk{t) and qkk'{t), respectively. The dynamics 
of the generalization error is given by substituting Eqs. (jH^ and 

{K-l 
1 - Rk{0) - Ck{t){Rk{0) - Rxm 
k=l 

K-l K-l 

+ cum - QkKm - Ckim - QkKm 

k=l k=l 

K-l K-l "I 

+ YY.Ck{t){l + qkk'iO) - qkK{0) - qk'KiO)) \ 

k=l k'=2 ) 

(36) 

As we mentioned, the optimal weight for average Ck generally depends on time t because eg 
is a function of time t. However, in this case, only depends on the initial value of order 
parameters -Rfc(O) and qkk'iS^) as shown in Eq. (jHUj) . deg/dCk is also independent of time t. 
Thus the optimal weight for average Ck does not depend on time t in this case. 

Figure El shows the ratio of the generalization error with and without parallel boosting 
for K = 2i student networks. Two of the student networks were identical (J^(0) = J^(0)), 
while J^(0) was independent of the others. The generalization error using parallel boosting 
is denoted as eg"PB, and that using bagging is denoted as eg"B in this figure. As the figure 
shows, parallel boosting is effective and the generalization error ratio with and without 
parallel boosting was the same throughout the learning process. The ratio of e^'^/ef was 
about 0.96. 

V. CONCLUSION 

We have analyzed the generalization error of an ensemble of linear perceptrons within 
the framework of on-line learning. Weights for averaging were introduced. We then derived 
simultaneous differential equations for the order parameters Ik-, Rk, and qkk' to calculate the 
generalization error. Here, Ik was the length of the student weight vector J^, Rk was the 
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FIG. 3: Comparison of the generalization error with and without parallel boosting. 



overlap between the teacher weight vector B and the student weight vectors J^, and Qkk' was 
the overlap between two student weight vectors J'^ and J'^' . We have assumed two initial 
conditions to calculate the generalization error of the ensemble of linear perceptrons: one 
was that the correlation between the weight vectors of linear perceptrons was homogeneous, 
and the other was that the correlation was inhomogeneous. 

In the homogeneous case, simple averaging over the K outputs of the linear perceptrons 
was valid to obtain the ensemble output of the linear perceptrons. We found that the 
generalization error was equal to half that of a single linear perceptron when the number of 
linear perceptrons K became infinite, and that the generalization error converged into that 
of the infinite case with 0(1/ K) when the number of linear perceptrons was finite. 

In the inhomogeneous case, the generalization error was improved by introducing the 
weights for averaging over the K outputs of the linear perceptrons. Order parameters 
and Qkk' are time dependent, so one might think the weight for averaging over the K outputs 
of the linear perceptrons might be time dependent. However, we found the weights were 
not time dependent when the initial value of the weight length lk{0) — 1, and they only 
depended on the initial correlation between the weight vectors of the linear perceptrons. We 
also carried out numerical simulations whose results agreed with the theoretical results, thus 
confirming the validity of the theoretical analysis. 
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